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ABSTRACT 

In  this  paper  we  present  a  full-custom  VLSI  design  of  high-speed  2-D  DCT/IDCT  proces¬ 
sor  based  on  the  new  class  of  time-recursive  algorithms  and  architectures  which  has  never  been 
implemented  to  prove  its  performance.  We  show  that  the  VLSI  implementation  of  this  class  of 
DCT/IDCT  algorithms  can  easily  meet  the  high-speed  requirements  of  HDTV  due  to  its  modular¬ 
ity,  regularity,  local  connectivity,  and  scalability.  Our  design  of  the  8x8  DCT/IDCT  can  operate 
at  50  MHz  with  a  400  Mbps  throughput  based  on  a  very  conservative  estimate  under  1.2/i  CMOS 
technology.  In  comparison  to  the  existing  designs,  our  approach  offers  many  advantages  that  can 
be  further  explored  for  even  higher  performance. 


'Ahis  work  was  supported  in  part  by  NSF  grant  MIP9309506,  ONR  grants  N00014- 93-1-0566  and  N00014-93- 
11028,  and  Maryland  Industrial  Partnership  MIPS/Micro-Star  grant. 


1  Introduction 


Recent  advances  in  various  aspects  of  digital  technology  have  made  possible  many  applications  of 
digital  video  such  as  HDTV,  teleconferencing,  and  multimedia  communications.  These  applications 
require  high-speed  transmission  of  vast  amounts  of  video  data.  Most  video  standards  such  as 
HDTV  video  coding,  H.261,  JPEG,  and  MPEG  use  discrete  cosine  transform  (DCT)  as  a  standard 
transform  coding  scheme  [1,  2,  3,  4].  The  DCT  is  however  very  computationally  intensive.  To  realize 
high-speed  and  cost-effective  DCT  for  video  coding,  one  needs  efficient  VLSI  implementations  so 
that  the  high  throughput  requirements  can  be  matched.  There  has  been  considerable  research 
in  efficient  mapping  of  these  algorithms  to  practical  and  feasable  VLSI  implementations  in  the 
recent  past  [5,  6,  7,  8,  9].  These  have  however  employed  irregular  butterfly  structures  with  global 
communications  resulting  in  complex  layout,  timing,  and  reliability  concerns  which  severely  limit 
the  operating  speed  and  expandability  in  VLSI  implementations. 

Recently  a  new  class  of  transform  coding  architectures  based  on  time-recursive  approach  has 
been  proposed  [10,  11,  12].  The  complexity  of  this  class  of  parallel  architectures  is  low,  e.g.  only 
AN  —  4  multipliers  are  needed  for  computing  the  2-D  DCT.  To  perform  inverse  DCT  (IDCT),  the 
computational  structure  is  the  same  with  only  an  additional  multiplier  needed  [12].  Thus,  the 
DCT  and  IDCT  can  be  naturally  combined  and  implemented  together.  This  class  of  architectures 
has  excellent  scalability,  i.e.  the  transform  size  N  can  be  made  any  integer  by  adding  or  deleting 
computational  modules  [10,  11,  12].  In  addition,  these  are  highly  parallel,  modular,  regular,  fully- 
pipelined,  and  locally-connected.  Thus  it  is  a  very  good  candidate  for  high-speed  video  applications. 
Also,  the  architecture  is  very  suitable  for  real-time  applications  as  the  time-recursive  concept  has 
been  exploited  to  eliminate  the  waiting  time  for  data  to  arrive.  Prom  the  VLSI  implementation 
point  of  view,  as  the  parallel  computational  HR  structures  are  decoupled  into  independent  modules, 
the  need  for  global  communication  is  eliminated. 

In  this  paper  we  present  a  novel  VLSI  implementation  for  the  time-recursive  2-D  DCT/IDCT 
processor.  The  class  of  time-recursive  parallel  architectures  has  never  been  designed  and  imple¬ 
mented  to  prove  its  superior  properties —  our  goal  here,  is  to  show  its  performance  under  full-custom 
VLSI  implementation  and  to  make  comparison  with  other  existing  VLSI  designs  based  on  different 
algorithms.  The  chip  design  has  been  carefully  optimized  based  on  appropriate  choice  of  wordlength 
and  device  elements  to  meet  the  expected  signal-to-noise  ratio,  the  design  of  distributed  arithmetic 
ROM  units,  and  transformation  and  re-distribution  of  clocking  and  pipelined  stages  to  improve 
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the  throughput.  The  simulation  of  our  design  of  the  8  x  8  2-D  DCT/IDCT  shows  that  it  can 
easily  operate  at  a  system  clock  rate  of  50  MHz  with  400  Mbps  throughput  under  1.2  fj,  CMOS 
technology t,  which  implies  that  it  can  perform  DCT/IDCT  under  the  HDTV  requirements. 

The  paper  is  organized  in  the  following  manner.  We  give  a  brief  summary  of  the  algorithm 
and  architecture  in  Section  2.  The  finite  wordlength  and  architectural  considerations  are  given  in 
Section  3.  The  VLSI  design  and  implementation  are  detailed  in  Section  4.  A  comparison  to  existing 
DCT/IDCT  VLSI  design  is  presented  in  Section  5,  followed  by  the  conclusion  in  Section  6. 


2  Algorithm  and  Architecture 


The  time-recursive  two-dimensional  DCT  for  a  N  x  N  image  block  {x(m, n)  :  m  —  t,t  +  1, . . .  ,t  + 
N  —  1;  n  =  0, 1, . . . ,  N  —  1}  is  defined  as  [11,  12]: 


v  n  i  j.\  ^  n  v''  (  \  Tt\{2{rn  —  t)  +  l]Jb  7r(2n  +  1)Z 

Xc(k,l,t)  =  —  C(k)C(l)  2^  2_j  x{m,n)  cos  — - - —  cos 

m—t  n=0 


2N 


2N 


(1) 


where 


C(k)  = 


l 

73 


if  k  =  0, 


1  otherwise. 

The  time  index  t  in  Xc(k,  t)  denotes  that  the  transform  starts  from  x(t).  The  2-D  IDCT  is  defined 
in  a  similar  manner: 


,  ^  2  ,n[2k  +  l]m^  r7r(2i  +  l)nn  /oN 

xc{m,  n)  -  —  2^  Xc(^,  l)C{k)C(l)  cos  [ - — - ]  cos  [ - — - ]  (2) 

DCT/IDCT  is  a  very  computationally  intensive  operation.  To  be  able  to  use  this  technique  for 
high-throughput  applications  such  as  HDTV  coding,  an  efficient  VLSI  implementation  is  essential. 
In  the  past  several  fast  algorithms  have  been  mapped  onto  VLSI  chips,  but  they  are  not  particularly 
well  adapted  for  VLSI  where  regularity,  modularity,  timing,  layout  complexity,  and  area  are  of  more 
concern. 

The  HR  algorithm  [12]  for  the  computation  of  the  DCT  is  a  direct  2-D  method  and  does  not 
require  transposition,  unlike  more  traditional  row-column  algorithms.  The  structure  is  derived  by 
considering  the  transform  operation  to  be  a  filter  which  transforms  the  serial  input  data  into  their 

T This  result  is  a  conservative  estimate 
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transform  coefficients. 

The  1-D  DCT  and  its  inverse  for  a  N  block  input  data,  starting  from  x(t)  and  ending  with 
x(t  +  N  —  1)  are  defined  as: 

r 2"  t+N-1  1  irk 

Xc(k,t)  =  C(k)  y  —  a;(n)  cos  [(n-i +-)—],  A  =  0, 1, . . . , N  -  1,  (3) 

xc{n,t)  =  \j~~  Y,  C{k -t)Xc{k)  cos  [(n+  ^)  —  n  =  0, 1, . . .  ,JV  -  1,  (4) 

k=t 

where  C(k)  is  as  defined  in  (1). 

The  transfer  function  for  the  forward  and  inverse  1-D  DCT  can  be  shown  to  be: 


Hic(z)  = 


jf  COS 


2n+l)(iV— lJTT 
2  N 


1  —  2  cos  I 


— + 1/4(4=  -  i)z-("-i). 

i  +  z-2  V  Nys/2 


The  signal  flow  graph  (SFG)  shown  in  Fig.  1  implements  (7)  or  (8)  i.e.  the  forward  and  the  inverse 
1-D  DCT  depending  on  the  multiplier  coefficients  and  the  modifications  to  the  SFG  as  indicated 


by  the  dashed  lines. 


The  kernel  shown  in  Fig.  1  computes  a  single  DCT  channel  coefficient  based  on  the  multiplier 
coefficient  encoded  in  that  particular  filter.  N  such  parallel  modules  (each  with  the  appropriate 
multiplier  coefficient  corresponding  to  k)  form  a  filter  bank  which  computes  the  N  coefficients  of 
the  1-D  transform.  Every  N  cycles,  the  1-D  transform  coefficients  for  a  new  data  set  is  computed 
in  parallel  by  the  N  filter  bank  modules.  These  1-D  transform  coefficients  are  then  fed  into  an 
identical  but  slowed  down  ( N  times)  filter  bank,  which  computes  the  2-D  transform  of  the  N2  data 
block.  The  block  diagram  for  the  2-D  DCT/IDCT  architecture  [11]  is  outlined  in  Fig.  2. 
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3  Finite  Wordlength  and  Architectural  Considerations 


In  the  realization  of  the  DCT  algorithm  there  are  tight  tradeoffs  between  various  criteria  like 
accuracy,  speed,  and  area  of  the  chip.  The  implementation  of  the  2-D  DCT/IDCT  algorithm  with 
finite  precision  arithmetic  (due  to  fixed  register  length)  introduces  truncation  errors.  To  minimize 
the  effect  of  truncation  errors,  one  needs  to  increase  the  register  length  i.e.  have  a  larger  internal  bus 
precision.  Doing  so  however  results  not  only  in  larger  area,  but  also  affects  the  speed  of  submodules 
such  as  adders  and  multipliers.  So  we  need  to  choose  the  optimum  register  length,  which  while 
ensuring  the  minimum  accuracy  criteria,  would  also  lead  to  a  high-speed  implementation  with  small 
chip  area. 

To  make  sure  the  accumulated  errors  do  not  exceed  the  video  coding  requirements  we  model 
the  2-D  HR  DCT/IDCT  architecture  in  C,  and  perform  simulations  to  verify  that  Peak  and  Av¬ 
erage  SNR  requirements  suitable  for  video  coding  applications  are  met  [5].  These  simulations  in 
conjunction  with  preliminary  timing  analysis  of  various  submodules  helped  in  deciding  the  final  ar¬ 
chitecture  suitable  for  a  high-performance  high-speed  2-D  DCT/IDCT  chip.  The  truncation  errors 
which  are  introduced  in  the  system  are  quantified  by  PSNR  and  Average  SNR.  The  Average  SNR 
-is  defined  as: - - - 


SNR  =  20  log 


i{x,y) 

\0(x,y)  -  I(x,y)\ 


(9) 


where  0(x,y)  and  I(x,y )  are  the  output  and  input  image  pixel  intensity  values  for  position  (x,  y). 
The  Peak  SNR  is  defined  in  a  similar  manner  with  the  only  difference  being  that  it  focuses  on  the 
noise  introduced  due  to  truncation,  and  therefore  assumes  peak  input  intensity  value. 


3.1  ROM  Lookup  vs  Parallel  Multiplier 

One  of  the  critical  submodules  of  this  design  was  the  multiplier  implementation.  There  are  pros 
and  cons  of  using  distributed  arithmetic  techniques  [13]  and  parallel  multipliers  [14,  15].  ROM 
table  lookups  can  be  done  very  fast  as  compared  to  using  a  regular  multiplier.  Also  its  accuracy 
is  higher  than  regular  multipliers  as  lookup-entries  are  precomputed  (with  64-bit  precision)  and 
stored  in  the  table.  The  only  drawback  is  that,  as  a  table  lookup,  its  size  grows  exponentially  with 
input  bit  precision.  Table  1  compares  some  of  the  preliminary  ROM  designs  with  the  best  parallel 
multiplier  that  was  available  to  us. 

We  see  that,  barring  a  need  for  a  16-bit  wide  input  for  the  ROM,  its  size  is  smaller  than  the 
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Precision 

Size 

Delay 

SUHlWSfflHfflfilll 

12  x  12 

1217A  x  1532A 

15  nS 

ROM 

12  x  12 

1292A  x  1646A 

15  nS 

ROM 

12  x  16 

1292A  x  1854A 

BUI 

ROM 

16  x  16 

Not  feasable 

Multiplier 

16  x  16 

3513A  x  1568A 

80  nS 

Table  1:  ROM  vs  Multiplier  Comparisons 

parallel  multiplier.  Also,  our  unique  design  of  the  ROM  allows  us  to  have  two  sets  of  table  entries 
interdigitated  with  an  area  increase  of  only  12%.  This  is  essential  if  we  are  to  compute  both  the 
forward  and  inverse  transforms  using  the  same  structure.  Thus  we  see,  not  only  is  the  ROM  smaller 
than  the  multiplier,  but  it  is  over  4  times  faster  than  the  best  parallel  multiplier  that  was  available 
to  us.  The  timing  simulations  are  for  the  2.0 fi  technology  parameters  (A  =  1  fi). 

3.2  Finite  Wordlength  Simulations 

For  many  video  standards  we  need  to  ensure  a  minimum  PSNR  of  40  dB  [16].  Fig.  3  illustrates  the 
architectural  simulations  which  help  in  quantifying  the  truncation  noise  in  the  system  by  PSNR 
computation.  The  blocks  for  forward  and  inverse  DCT  are  C  architectural  models  which  model 
the  finite  precision  arithmetic  of  the  IIR  structure  for  the  2-D  DCT/IDCT  computation,  and  helps 
in  estimating  the  Peak  and  average  SNR.  Simulations  are  performed  for  a  set  of  different  system 
parameters  aimed  at  meeting  minimum  SNR  criterion  while  at  the  same  time,  minimizing  area  and 
maximizing  speed. 


Figure  3:  Accuracy  simulation  block-outline 

Several  input  images — LENA,  SAILBOAT,  AIRPLANE,  and  random  data — have  been  used  in 
these  simulations.  These  images  are  512  x  512  pixels,  with  each  pixel  being  specified  by  8-bit  word 
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corresponding  to  256  gray  levels.  As  our  image  block  size  is  8  x  8  pixels,  a  single  image  yields 
4096  blocks  for  which  the  2-D  DCT-IDCT  is  computed  and  the  PSNR  statistics  collected.  The 
highlights  of  these  simulations  are  presented  in  Table  2. 


Internal  bus  Average 
precision  SNR 


21.1  dB  27.3  dB 


37.8  dB  44.0  dB 


38.0  dB  44.2  dB 


Table  2:  Finite  precision  arithmetic  simulation  of  HR  structure. 

As  we  can  see,  the  minimum  PSNR  requirements  are  adequately  satisfied  if  we  use  a  system  bus 
precision  of  16-bits  with  a  12-bit  wide  ROM  input  wordlength.  The  16-bit  input,  however,  increase 
the  ROM  area  by  four  times  with  almost  the  same  PSNR. 


4  VLSI  Design  and  Implementation 


4tself-4o- 


efficient  VLSI  implementations.  To  achieve  a  high-speed  design  (with  minimum  area)  we  use  the 
full-custom  approach.  All  submodules  have  been  designed  with  careful  regard  to  area  and  speed 
issues.  Particular  design  care  is  taken  to  ensure  that  the  critical  path  modules  such  as  ROM  lookup 
and  adder  are  optimized.  A  highly  hierarchical  and  modular  strategy  is  employed  in  the  chip  design. 
By  employing  such  a  design  strategy,  we  not  only  reduce  the  design  time  and  effort  but  also  have 
improved  reliability. 


4.1  Design  and  Simulation  Tools/Methodology 

The  VLSI  layout  editor  MAGIC  is  used  to  implement  the  full-custom  2-D  DOT /IDCT  chip.  The 
various  submodules  needed  for  the  chip  design — ROM  lookup,  adder,  latch,  delay,  multiplexer,  and 
invertor — are  laid  out  first.  These  submodules  are  characterized  and  their  functionality  verified 
before  they  are  used  as  macro-cells.  These  macro-cells  are  instantiated  and  used  in  the  higher-level 
hierarchies  like  ”1-D  Channel”  and  ”2-D  Channel”  modules,  which  in  turn  are  larger  macro-cells  in 
the  next  higher  level.  This  hierarchical  approach  minimizes  the  design  effort  considerably,  as  one 
needs  to  only  consider  tiling,  placement  and  routing  of  various  modules  at  the  current  level. 
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Crystal,  and  primarily,  Spice  are  used  to  perform  the  timing  simulations.  Crystal,  a  semi¬ 
interactive  program  is  used  to  estimate  the  critical  path  in  a  submodule  between  a  specified  set 
of  input  and  output  vectors.  For  a  given  change  in  the  input  vector,  worst  case  times/delays 
are  propagated  to  the  output  vectors.  This  gives  us  information  about  the  critical  paths.  Using 
this  as  a  starting  point,  Spice  is  used  to  perform  a  detailed  timing  analysis  of  critical  paths  of 
various  macrocells.  In  our  design,  the  slowest  modules  turned  out  to  be  the  ROM  lookup  and  carry 
lookahead  adder. 

Functionality  verification  is  done  using  IRSIM  which  is  an  event-driven  logic-level  simulator. 
The  logic-level  simulations  are  performed  at  a  much  higher  level  of  circuit  abstraction,  treating  the 
transistors  as  switches  which  are  either  ON  or  OFF.  IRSIM  is  used  to  test  not  only  the  functionality 
of  all  macrocells,  but  is  also  used  to  perform  logic-level  simulations  and  verify  the  functionality  at  all 
hierarchy  levels — starting  from  the  bottommost,  like  that  of  the  macrocells,  to  intermediate  levels 
such  as  1-D/2-D  channel  modules,  then  to  the  topmost  level,  namely  the  entire  2-D  DCT/IDCT 
chip. 

4.2  Distributed  Arithmetic 

This  is  perhaps  one  of  the  most  critical  submodules  designed  in  this  project.  The  ROM  is  optimized 
both  for  speed  and  area.  The  area  optimization  is  especially  critical,  as  the  ROM  is  arrayed  34 
times  in  the  complete  chip  and  occupies  a  significant  percentage  of  the  chip  area.  Even  a  small 
reduction  in  ROM  area  would  affect  our  chip  area  statistics  considerably.  While  speedwise,  ROMs 
are  superior  to  multipliers,  their  area  increases  exponentially  with  the  input  word  size. 

The  optimal  system  design,  satisfying  accuracy  requirements  and  minimizing  area  required  a 
16-bit  internal  bus  with  a  12-bit  ROM  input  wordlength.  A  straightforward  implementation  of 
this  ROM  lookup  table  would  need  212  (or  4096)  rows  of  16-bit  words,  which  is  too  large  to  be 
implemented.  However  by  using  partial  sums  method,  as  illustrated  in  Fig.  4,  we  can  reduce  the 
size  of  the  lookup  table  to  26  (or  64)  rows,  but  this  requires  two  separate  lookup  tables  along  with 
a  fast  adder  to  combine  the  high  and  low  order  sums.  Also,  our  ROM  structure  should  have  the 
capability  of  computing  the  2  products,  for  both  forward  and  inverse  transforms.  We  will  see  in 
the  subsequent  paragraphs  how  the  high  and  low  order  tables  and  interdigitation  is  employed  to 
craft  a  compact  and  fast  ROM  lookup. 

The  12-bit  input  is  split  into  two  6-bit  words,  Inp/,  and  Inp//.  The  multiplication  is  effected  in 
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Out  ig  =  Cixlnpi2 


Figure  4:  ROM  Design  Strategy 

the  following  manner.  The  output,  Out  =  C\  x  Inp,  is  computed  as: 

Out  =  2®  x  O  x  Inpn  +  C\  x  InpL- 


where  the  input 


Inp  =  26  x  Inp u  +  InpL- 


The  two  sub-products  are  precomputed  with  sufficient  accuracy  and  storing  in  the  ROM  lookup 
table.  The  output  is  formed  by  adding  the  sign-extended  lower  order  product  to  the  higher  order 
product.  This  is  illustrated  in  Fig.  4  (b).  We  need  two  tables,  each  with  26  or  64  rows  only.  It 
turns  out  that  we  need  to  store  16  bits  for  the  upper  precomputed  product  and  11  bits  for  the  lower 
product.  The  lower  and  higher  order  ROM  entries  are  to  be  added  with  the  proper  shift,  taking 
into  account  the  2’s  complement  representation. 


4.2.1  Design  details 

Most  ROMs  are  based  on  the  precharged  scheme,  where  the  bit  lines  are  precharged  high  and 
during  the  evaluate  phase,  according  to  the  stored  bit-sequence,  selected  lines  are  discharged.  Our 
ROM  is  based  on  a  novel  design,  which  reduces  the  access  time.  The  main  components  of  the  ROM 
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are  tree-based  row  decoder,  memory  cells,  and  sense-amp.  The  ROM  schematic  is  shown  in  Fig.  5. 


Out\e  —  Ci  x  Inpu 


Outw  =  C2  x  Inpu 


Figure  5:  ROM  Multiplier  Schematic 

Decoder:  The  6-bit  input  lines  are  decoded  and  the  appropriate  ROM  row  select  line  is 
Instead  of  a  straightforward  6-bit  decoder  implementation,  we  use  two  3-bit  decoders  and 
of  64  AND  gates.  This  technique  helps  reduce  the  layout  complexity  and  also  results  in 
shorter  access  time.  The  VLSI  layout  of  the  6-bit  decoder  is  shown  in  Fig.  6. 

Lookup  Table:  The  ROM  unit  cells  which  encode  a  1  or  0  are  shown  in  Fig.  7.  These  cells  are 
identical  and  measure  13A  x  16A.  Depending  on  whether  a  logic  high  or  low  is  to  be  stored,  we  have 
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6-64 

selected, 
an  array 


Figure  6:  Physical  layout  6-bit  decoder  (377A  x  1292A). 

a  n-channel  transistor  connected  between  the  output  line  and  either  Vdd  or  GND,  respectively. 
The  transistor  gate  is  connected  to  the  row-select  line  of  the  decoder.  In  our  ROM  table  there  are 
totally  (16  +  11)  x  2  x  64  or  3456  unit  cells.  Each  cell  corresponds  to  one  bit  of  storage.  The  pitch 
of  these  cells  is  designed  to  be  half  that  of  the  sense-amp  (SA)  which  lets  up  place  two  cells  for  a 
single  SA.  So  by  placing  one  set  of  SAs  at  the  top  of  the  table  and  another  below,  we  are  able  to 
interdigitate  two  lookup  tables  with  minimal  increase  (only  12%)  in  area  requirement. 

Sense- Amp:  The  function  of  this  simple  module  is  to  speed  up  the  word-lookup.  It  works 
on  the  simple  principle  of  precharging  the  bit-lines  to  an  intermediate  voltage  between  VDD  and 
GND.  In  this  way,  regardless  of  whether  the  bit-lines  are  turned  on  or  off,  the  delay  time  is  reduced. 
The  sense-amp  schematic  is  shown  in  Fig  8  (b).  In  the  precharge  phase  of  the  clock  (4>  is  low), 
the  p-MOS  shorts  the  input  and  output  of  the  invertor,  which  forces  the  bit-lines  to  the  invertor 
transition  point  at  2.5  volts  (Simulations  have  however  indicated  that  this  voltage  is  closer  to  3 
volts).  In  the  evaluate  phase,  the  p-MOS  transistor  turns  off  and  the  bit- line  signal  is  latched  at 
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Figure  7:  ROM  unit-cells  of  size  13A  x  16A.  Encode  bits  1  and  0  . 


the  output  through  the  simple  butterfly-pass  transistor.  The  SA  measures  26 A  x  94A.  The  pitch  of 
the  SA  is  twice  that  of  the  unit-cell,  allowing  two  such  cell  for  every  SA.  By  placing  one  set  of  SA 
at  the  bottom,  and  another  at  the  top,  the  product  for  two  coefficients  is  computed  at  the  same 
time. 


4.2.2  ROM  Implementation 

The  first  ROM  is  designed  in  the  regular  manner — design  the  individual  cells  like  sense-amp,  3- 
bit  decoders,  6-bit  decoders,  and  0/1  unit  cells  in  their  various  orientation  and  at  UP/DOWN 
positions.  Once  a  complete  ROM  is  assembled  (with  dummy  coefficients),  it  is  broken  up  into 
subunits/cells,  and  their  locations  and  arraying  information  noted.  Using  a  perl  script,  new  ROMs 
are  assembled  by  stiching  the  various  subunits/cells  in  the  appropriate  locations.  The  crucial  part 
is  the  encoding  of  the  interdigitated  coefficients.  The  sine  or  cosine  coefficients  are  computed  using 
double-precision  floating  point  arithmetic  on  the  Sun  workstation,  and  given  as  input  to  the  perl 
script.  These  numbers  are  converted  to  2s  complement  binary,  and  routines  called  to  place  the 
”  1-unitcell”  or  ”  0- unitcell”  depending  on  the  bit- value  at  that  particular  location. 

The  perl  script  writes  the  placement  information  of  the  ROM  subunits  in  a  temporary  file.  The 
script  then  invokes  MAGIC  as  a  child  process  which  does  the  actual  job  of  putting  together  all  the 
modules  to  form  the  final  ROM  submodule.  The  automatically  generated  ROM  is  carefully  tested 
for  functionality  verification,  design  rule  violations  and  timing  analysis.  To  generate  ROM  lookup 


ROM  bit-line 


Sense-Amp  output 
to  Adder 


,)  Physical  layout  of  SA  (26A  x  94A),  (b)  Corresponding  circuit  schematic, 


Figure  8:  Sense-Amj 


tables  with  different  multipliers,  the  process  is  repeated  for  all  the  channels.  Multiplier  coefficients 
required  for  the  various  channels  are  computed  and  the  ROM  generator  invoked  for  each  of  these. 

The  size  of  the  basic  ROM  structure  (with  SA)  which  encodes  two  multiplier  coefficients  is 
1292Ax  1854A.  ff  we  include  the  adders  to  combine  the  high  and  low  order  products,  the  ROM/adder 
assembly  measures  2226 A  x  1854A.  The  combined  ROM  lookup-table  and  adder  to  effect  multipli¬ 
cation  is  shown  in  Fig.  9. 


4.2.3  Timing 

To  compute  the  propagation  delay  in  the  ROM  lookup  table,  Spice  is  used.  Using  extraction  style 
for  1.2 /j,  or  2.0/i,  the  circuit  parasitics  are  extracted.  The  timing  analysis  for  ROM  lookup-table 
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Figure  9:  Physical  layout  of  ROM  lookup-table  with  associated  adders.  Dimension  is  2226Ax  1854A. 
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ROM  Propagation  Delay 


Technology:  2.0  micron 
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Time  (in  seconds) 


x  10 


Figure  10:  2.0/i  ROM 

extracted  using  the  2.0/i  technology  scaling  parameters  (of  Magic)  is  shown  in  Fig.  10.  The  ROM 
input  is  pulsed  to  5V  at  3ns  and  the  clock  at  5ns.  The  ‘Row  select’  line  is  the  output  of  the  6-bit 
decoder  which  selects  one  of  the  words  in  the  ROM  table.  The  propagation  delay  is  defined  to  as 
the  time  difference  between  the  50%  points  of  the  input  and  the  ‘Row  select’  lines.  The  bit  line  is 
at  2.9  volts,  and  it  discharges  to  0.5  volts  at  16.5  ns.  The  propagation  delay  for  the  ROM  lookup 
is  13.5ns. 

4.3  Adder 

There  are  several  possible  designs  for  adders.  They  could  be  based  on  the  simple  majority  logic 
function  approach  [17,  18]  implemented  using  straightforward  combinational  gates.  Implementation 
can  employ  either  the  simple,  but  slow  ripple  carry  adder  or  utilize  the  fast,  but  complex  carry 
lookahead  (CLA)  method.  There  are  pros  and  cons  to  every  choice,  and  depending  upon  the 
application  one  should  choose  the  appropriate  design. 

We  need  to  choose  the  design  which  would  yield  the  least  propagation  delay,  and  at  the  same 
time  would  not  require  elaborate  clocking  precharge  schemes  or  take  up  too  much  area.  Carry 
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Figure  11:  Physical  layout  of  the  16-bit  carry-lookahead-ripple  Adder.  Module  dimension  is  1348Ax 
355A. 

lookahead  adders  would  give  us  the  best  performance  in  terms  of  delay,  but  it  is  not  practical  to 
implement  CLA  with  more  than  4-bits  as  they  become  too  large.  Ripple  carry  adders  on  the  other 
hand,  would  yield  compact  layouts,  but  their  speed  would  depend  on  the  number  of  bits  being 
added. 

In  our  case,  accuracy  simulations  indicated  the  need  for  a  16-bit  wide  internal  bus.  Clearly,  a 
simple  ripple  carry  adder  would  be  too  slow  for  our  requirements.  At  the  same  time,  building  a 
CLA  scheme  for  16  bits  is  impractical.  A  good  compromise  that  we  chose  was  to  implement  a  4-bit 
CLA,  and  connect  up  four  of  these  units  in  a  ripple  fashion  to  obtain  a  16-bit  carry  lookahead-cum- 
ripple  adder.  The  basic  4-bit  carry  lookahead  adder  implementation  is  based  on  [17].  The  adder 
module  is  shown  in  Fig.  11.  This  module  has  704  transistors  and  measures  1348 A  x  355A  units. 

4.3.1  Timing 

We  expect  the  longest  delay  in  the  adder  to  be  the  signal  propagated  from  carry- in  to  the  MSB. 
Here,  as  carry-in  is  always  preset  to  0/1,  the  longest  delay  is  from  the  LSB  to  the  MSB.  Simulations 
performed  using  1.2  /j,  technology  parameters  indicate  a  maximum  propagation  delay  of  9  ns.  The 
adder  inputs  are  FFFF^ea;  and  0000/iea;.  ‘InAl’  is  pulsed  from  0  to  5  volts  at  2ns.  ‘Outl6’  drops  to 
0V  after  the  propagation  delay  as  shown  in  Fig.  12. 

4.4  Clocking 

The  propagation  delays  of  ROM  and  adder  helps  us  decide  retiming  and  clocking  issues.  Fig.  13 
illustrates  the  implementation  of  the  1-D  kernel  SFG  using  a  single-phase  clock  (with  static  latches) 
or  a  two-phase  non-overlapping  clock  (with  dynamic  latches).  From  both  speed  and  area  viewpoint 
the  two-phase  clocking  scheme  turns  out  to  be  a  better  choice. 
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Figure  12:  Adder 

A  two-phase  clock  permits  us  to  use  dynamic  latches  which  are  simpler  to  design  and  are 
more  compact  as  compared  to  static  latches.  Two-phase  clocks  gives  us  the  flexibility  to  retime 
the  SFG  such  that  the  delays  in  the  various  critical  paths  are  equalized.  This  is  illustrated  in 
Fig.  13  (a)  and  (b).  In  Fig.  13  (a),  the  propagation  delay  between  any  two  subsequent  latches  is 
(3A+R,  2A),  where  A  and  R  are  the  propagation  delays  of  adder  and  ROM.  The  maximum  delay 
between  two  subsequent  latches  is  the  critical  path  and  will  determine  the  fastest  possible  clock 
rate.  In  Fig.  13  (a)  it  is  (3A+R).  Knowing  that  the  adder  and  ROM  delay  are  about  the  same,  the 
critical  path  is  retimed  as  shown  in  Fig.  13  (b).  The  maximum  delay  now  is  either  (A+R)  or  2 A; 
which  means  that  the  critical  path  delay  is  halved,  and  thus  the  maximum  clocking  rate  is  doubled. 

4.5  Other  Submodules 

Various  other  macrocells  which  are  needed  in  our  implementation  are  described  here.  Half-latch, 
delay,  multiplexer,  and  inverter  are  some  of  the  modules  which  have  been  designed. 
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(a)  Single  phase  clock 


(b)  Two-phase  clocking  scheme 


Figure  13:  Clock  Speedup 


4.5.1  Half-latch  with  Reset 

The  schematic  for  the  half-latch  is  shown  in  Fig.  14,  and  the  VLSI  layout  in  Fig.  15.  It  is  16  bits 
wide  corresponding  to  the  internal  bus  precision.  Depending  on  its  location  in  signal  flow  graph 
(see  Fig.  17),  it  latches  on  either  </>i  or  fo.  A  reset  control  is  also  provided.  To  design  the  latch, 
a  single  bit  is  laid  out,  which  is  tested  with  Irsim  and  Spice  for  functionality  and  timing.  The  size 
of  the  output  invertor  is  made  sufficiently  large  to  allow  adequate  driving  of  expected  output  load 
in  the  module  where  it  will  be  used.  Once  the  testing  is  completed,  the  final  module  is  assembled 
by  arraying  the  one  bit.  Local  distribution  of  all  control  signals  are  taken  into  account  at  the  1-bit 
design  stage  itself.  So  after  the  arraying,  the  module  is  ready  with  all  clock/control  signals  and 
power  rails  already  wired.  Of  course,  the  module  is  not  fully  ready  until  functional  verification  is 
done  again  for  the  16-bit  wide  latch. 
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Figure  14:  Schematic  of  Half-latch  with  Reset  Control  Circuitry- 


Figure  15:  Physical  layout  of  Half-latch  with  Reset.  Module  dimension  is  938A  x  127A. 

4.5.2  Delay,  Delay7  and  Delay8 

These  are  shift- registers  which  act  as  clock  delays  for  1,  7,  and  8  cycles.  As  before  they  are 
implemented  by  modifying  the  basic  1-bit  unit  to  latch  for  an  entire  clock  period.  The  1-bit  1-clock 
period  delay  unit  is  arrayed  16  times  to  obtain  a  16-bit  wide  bus.  And  this  is  further  arrayed  7  or  8 
times  to  get  the  final  delay  modules.  The  control/clock  signals  required  by  this  module  are  ‘Reset’, 
01,  and  02 •  The  ‘delay’  module  measures  938A  x  217A,  the  ‘delay 7’  measures  1028A  x  1549A,  and 
the  ‘delay8’,  1028A  x  1771A. 

4.5.3  Multiplexers 

These  are  needed  to  choose  between  two  sets  of  signals.  It  is  used  after  the  ROM  (see  Fig.  17)  and 
other  locations  in  the  SFG  affected  by  a  switch  from  computing  the  forward  to  the  inverse  transform. 
The  circuit  schematic  is  illustrated  in  Fig.  16.  The  size  of  the  multiplexer  is  1271A  x  119A. 
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Figure  16:  Schematic  of  Multiplexer 

4.6  1-D  Channel  Module 

The  macrocells  which  have  been  described  so  far  are  put  together  to  form  the  1-D  channel  module 
at  the  next  higher  hierarchy.  The  channel  module  implements  the  SFG  of  the  kernel  as  is  shown 
in  Fig.  17.  This  module,  depending  on  the  value  stored  in  the  ROM,  computes  the  appropriate 
DCT/IDCT  transform  coefficient.  Magic  ‘instantiates’  [16]  every  occurrence  of  these  modules, 
resulting  in  much  faster  layout  and  extraction  of  the  module. 

The  signal  flow  graph  implemented  by  this  module  is  shown  in  Fig.  17.  All  but  multiplier 
M3  (needed  only  for  inverse  transform  as  indicated  in  Fig.  1)  and  the  associated  multiplexers  and 
latches  are  included  in  this  module.  Since  only  one  set  of  M3  and  its  associated  circuitry  is  needed 
for  the  entire  1-D  module,  it  is  designed  separately. 

The  physical  layout  of  the  1-D  channel  is  shown  in  Fig.  18.  All  of  the  eight  channel  modules 
are  identical  except  that  they  instantiate  different  ROM  tables.  The  circuit  has  been  laid  out  in 
such  a  manner  that  it  facilitates  easy  modular  development.  Inter-module  connections  are  brought 
to  the  edges  of  the  blocks  where  they  get  connected  with  the  other  modules  wire-segments  when 
these  modules  are  tiled.  By  adopting  such  a  methodology,  we  save  considerably  in  design  time  and 
effort,  and  at  the  same  time,  if  the  modules  are  designed  properly  (matching  pitch),  we  save  in 
interconnection  area  requirement  also.  The  power  rails,  input/output,  and  other  important  control 
signals  are  routed  from  top  to  bottom  in  each  module.  As  they  are  tiled  vertically  the  routing  of 
all  signals  is  done  automatically.  We  only  have  to  concern  ourselves  with  feeding  these  signals  to 
the  entire  1-D  module,  either  from  the  top  or  the  bottom,  as  local  distribution  of  these  signals  is 
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Figure  17:  1-D  HR  SFG  with  clocks  and  latches 

already  taken  care  of  in  the  design  of  the  channel-module. 

The  eight  1-D  channel  modules  are  tiled  one  over  the  other  to  form  the  complete  1-D  DCT.  To 
compute  the  IDCT,  the  additional  M3  and  its  associated  delay  units  is  separately  attached  to  the 
1-D  module,  below  Channel  7.  The  physical  layout  of  this  module  is  shown  in  Fig.  19. 

4.7  2-D  Channel  Module 

This  module  is  designed  along  similar  lines  as  the  previous  1-D  channel-module.  The  SFG  is 
in  Fig.  20)  is  almost  the  same,  except  that  8-block  delay  units  are  present  in  both  loops.  This 
facilitates  computation  of  8  blocks  of  data  in  a  time-displaced  parallel  fashion.  The  same  concept 
can  also  viewed  from  systolic  point  of  view.  The  VLSI  layout  of  this  module  is  shown  in  Fig.  21. 

4.8  CSA  and  Control  Signal  Distribution 

This  was  the  last  module  designed.  It  takes  the  8  parallel  16-bit  words  generated  by  the  1-D  module 
and  feeds  it  serially  to  the  2-D  module.  The  Circular  Shift  Array  (CSA)  module  serves  another 
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Figure  18:  VLSI  layout  of  1-D  channel.  The  module  measures  2742A  x  8029A. 


important  function — of  storing  the  1-D  IDCT  coefficients  of  the  first  row  required  for  computation 
of  the  inverse  at  the  second  stage. 

There  are  several  control  signals  for  this  module — to  read  data  from  the  1-D  module,  to  latch 
it  in,  and  shift  it  out  serially  to  the  2-D  module,  to  latch  in  data  to  help  in  computing  of  the 
inverse,  and  to  hold  it  until  required.  These  control  signals  are  Lat_shft_C  and  Lat_inv_C  as  given 
in  Table  3.  It  is  during  the  time  the  1-D  module  is  being  reset  that  the  1-D  channel  outputs  are 
latched  into  the  CSA.  At  the  same  time  the  2-D  DCT  is  also  being  computed.  In  the  clock  period 
when  the  1-D  channel  modules  are  being  reset,  the  2-D  channel  modules  are  not  clocked.  This  is 
done  to  ensure  synchronicity.  For  the  same  reason,  the  CSA’s  second  set  of  shift  registers  which 
circulate  the  first  row  of  1-D  DCT  coefficients,  is  also  not  clocked  in  that  period. 

Routing  of  the  control  signals  is  a  fairly  important  issue  as  there  are  quite  a  few  control  signals 
that  need  to  be  distributed  to  various  clocked  modules  like  latches  and  delays,  and  multiplexers. 
The  basic  idea  is  to  distribute  the  master  control  signals  to  the  high-level  cells  like  1-D/2-D  channel 
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Figure  19:  The  layout  of  the  additional  M3  module  for  the  inverse  transform 


modules  and  use  a  buffer  to  generate  the  local  control  signals,  which  are  then  distributed  to  all 
modules  within  that  particular  submodule.  This  is  essentially  a  multi-level  tree  distribution  of  the 
control  signals.  By  employing  such  a  scheme,  we  are  not  only  able  to  minimize  skew,  but  also  have 
improved  rise  and  fall  times.  This  scheme  is  particularly  relevant  to  the  clock  signal  distribution. 

The  ‘rst’  control  signal  is  to  be  distributed  to  all  those  modules  that  are  clocked  as  they  need 
to  be  reset  between  blocks.  The  ‘fwd’  signal  which  determines  whether  the  forward  or  inverse 
DCT  is  computed  is  routed  to  all  the  multiplexers.  Those  modules  require  these  control  signals 
also  need  the  complement  (which  is  generated  at  the  local  buffer).  It  is  not  necessary  to  route  the 
complement  of  the  control  signals  on  a  global  chip  scale.  Routing  of  these  control  signals  to  each 
bit  slice  is  built  into  the  sub-module  design. 

4.9  2-D  DCT/IDCT  Chip 

Using  all  the  cells — 1-D,  2-D,  and  CSA — already  described,  the  final  chip  is  assembled.  The 
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Figure  20:  2-D  HR  SFG  with  latches  and  clock  details 

floorplan  of  the  chip  is  shown  in  Fig.  22.  In  the  floorplan  one  can  identify  the  various  cells  that 
have  been  mentioned  earlier  in  this  description.  The  physical  layout  of  the  2-D  DCT/IDCT  chip 
is  shown  in  Fig.  23.  The  chip  description  and  statistics  are  tabulated  in  Table  3  and  Table  4. 

5  Comparisons 

Some  designs  published  in  the  literatures  are  compared  to  our  design.  As  can  be  seen  in  Table 
5,  The  designs  in  [5,  8]  are  for  low-bit  rate  CODECs  that  operate  at  about  15  MHz.  Both  designs 
used  the  butterfly  architecture  so  that  there  is  no  flexibility  in  transform  size  N.  The  transposition 
is  required.  Both  employed  distributed  arithmetic  implementation.  The  DCT/IDCT  chip  in  [6] 
used  of  silicon  compiler  to  design  the  whole  system  under  the  0.8 /j,  triple  metal  technology.  The 
booth  multiplier  is  used  instead  of  distributed  arithmetic.  The  transposition  is  also  employed.  This 
chip  can  operate  at  50  MHz,  but  the  bit  throughput  rate  is  unknown. 

It  seems  that  our  design  has  many  pins,  this  is  due  to  ease  of  debugging.  In  fact,  the  active 
pins  are  only  38.  In  particular,  a  FIFO  buffer  can  be  placed  at  output  to  reduce  output  pin  count. 
The  chip  is  a  regular,  highly  modular,  and  fully-pipelined  structure  -  hence  full-custom  possible  by 
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Function 

Compute  2-D  DCT/IDCT 

Mode  Selection 

Active  High  ‘Fwd’  for  Forward 

Clock  Pins 

Phi  1:  (f> i  &  Phi  2: 

Reset  Pin  (ID) 

Active  high  ‘Rst’  pin 

Reset  Pin  (2D) 

Active  high  ‘Rst_2d_C’  pin 

CSA  Control 

‘Lat-shft-C’  and  ‘Lat-inv_C’ 

2-D  Inverse 

‘Inv-load-C’  high  for  last  block 

Input  Pins 

Chiplnl5,  . . . ,  ChipInO  (16  pins) 

Test  Output 

CSA:  1DOUT15,  . . . ,  1DOUTO  (16  pins) 

2-D  Output 

KOOUT16  —  KOOUT1,  . . . ,  K70UT16  —  K70UT1 

Power  Rails 

‘VddP  &  ‘GND!’ 

Table  3:  2D-DCT/IDCT  Chip  Description 


Technology 

1.2/i  CMOS  N-well 

Die  Dimensions 

24550A  x  27094A 

Chip  Area 

#  of  Transistors 

320,000 

Speed 

50  MHz 

Data  rate 

400  Mb/s 

Table  4:  2D-DCT/IDCT  Chip  Statistics 


Sun-Chen  [5] 

Fuji  war  a  et  al  [8] 

Miyazaki  et  al  [6] 

Srinivasan-Liu 

Function 

16  x  16  DCT 

8x8  DCT/IDCT 

8x8  DCT/IDCT 

8x8  DCT/IDCT 

Technology 

2.0/z 

l.OjU 

0.8/u  triple  metal 

1.2 /i  double-metal 

8.3  x  8.1  mm2 

10.7  x  10.2  mm2 

12.8  x  12.6  mm2 

wmmwmmm 

73,000 

156,000 

180,000 

320,000 

Pads 

25  active  pads 

68  PLCC  pkg. 

72  PGA 

176  (Active  pads  ~38) 

Max.  Speed 

15.1  MHz 

15  MHz 

50  MHz 

50*  MHz  or  400*  Mbps 

Structure 

Butterfly  (irreg) 

Irregular 

Irregular 

Regular  (modular) 

Table  5:  Comparisons  (*  The  ROM  simulation  results  were  based  on  the  2  / im  CMOS  technology. 
Therefore,  the  actual  figures  should  be  higher). 
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Figure  21:  2-D  Module  Layout  (2784A  x  10672A) 


using  Magic.  Based  on  a  conservative  estimation  it  can  operate  at  50  MHz  under  the  1.2/j,  CMOS 
technology.  If  0.8 /x  technology  used,  then  it  can  be  much  faster. 

A  detailed  comparison  is  given  in  Table  5. 

6  Conclusion 

In  this  paper,  we  have  presented  a  VLSI  implementation  of  a  high-performance  high-speed  2-D 
DCT/IDCT  chip.  It  is  a  full-custom  implementation  employing  a  highly  modular  and  hierarchical 
design  strategy.  Distributed  arithmetic  is  used  for  fast  and  compact  multipliers.  Non-overlapping 
two-phase  clocking  scheme  leads  to  a  faster  and  more  compact  layout  of  the  kernel.  Architectural 
simulations  are  conducted  for  choosing  system  parameters  that  ensure  adequate  accuracy  while 
minimizing  chip  area.  The  2-D  DCT/IDCT  chip  dimensions  are  24550A  x  27094A  and  its  area  is 
240  mm 2  based  on  1.2  p  technology.  The  pin-count  is  176,  and  the  chip  has  over  320,000  transistors. 
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Figure  22:  Floorplan  of  2-D  DCT/IDCT  Chip. 
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Figure  23:  Two  Dimensional  DCT/IDCT  Chip.  Chip  measures  24550 A  x  27094A.  Area  is  240  mm2. 
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Timing  simulations  performed  using  Spice  indicate  a  clock  frequency  of  50  MHz  corresponding  to  a 
data  throughput  of  400  Mb/s.  We  have  shown  that  VLSI  design  based  on  the  class  of  time-recursive 
algorithms  and  architectures  can  easily  meet  with  high-speed  requirements.  In  comparison  to  the 
existing  designs,  our  approach  offers  many  advantages  that  can  be  further  explored  for  even  higher 
performance.  This  chip  has  been  submitted  for  fabrication  in  1.2 /j,  CMOS  N-well  double- metal 
single-poly  technology. 
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